{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fine-tune Embedding models for Retrieval Augmented Generation (RAG)\n",
    "\n",
    "Embedding models are crucial for successful RAG applications, but they're often trained on general knowledge, which limits their effectiveness for company or domain specific adoption. Customizing embedding for your domain specific data can significantly boost the retrieval performance of your RAG Application. With the new release of [Sentence Transformers 3](https://huggingface.co/blog/train-sentence-transformers), it's easier than ever to fine-tune embedding models. \n",
    "\n",
    "In this blog, we'll show you how to fine-tune an embedding model for a financial RAG applications using a synthetic dataset from the [2023_10 NVIDIA SEC Filing](https://stocklight.com/stocks/us/nasdaq-nvda/nvidia/annual-reports/nasdaq-nvda-2023-10K-23668751.pdf). We'll also leverage Matryoshka Representation Learning to boost efficiency. In the blog, we are going to: \n",
    "\n",
    "1. Create & Prepare embedding dataset\n",
    "2. Create baseline and evaluate pretrained model \n",
    "3. Define loss function with Matryoshka Representation\n",
    "4. Fine-tune embedding model with `SentenceTransformersTrainer`\n",
    "5. Evaluate fine-tuned model against baseline\n",
    "\n",
    "**🪆 Matryoshka Embeddings**\n",
    "\n",
    "[Matryoshka Representation Learning (MRL)](https://huggingface.co/blog/matryoshka) is a technique designed to create embeddings that can be truncated to various dimensions without significant loss of performance. This approach frontloads important information into earlier dimensions of the embedding, allowing for efficient storage and processing while maintaining high accuracy in downstream tasks such as retrieval, classification, and clustering. \n",
    "\n",
    "For example, a Matryoshka model can preserve ~99.9% of its performance while needing 3x less storage. This is particularly useful for applications where storage and processing resources are limited, such as on-device applications or large-scale retrieval systems.\n",
    "\n",
    "\n",
    "_Note: This blog was created to run on consumer size GPUs (24GB), e.g. NVIDIA A10G or RTX 4090/3090, but can be easily adapted to run on bigger GPUs._\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we can get started we need to install Hugging Face Libraries and Pytorch, including Sentence Transformers, transformers and datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install Pytorch & other libraries\n",
    "!pip install \"torch==2.1.2\" tensorboard\n",
    "\n",
    "# Install Hugging Face libraries\n",
    "!pip install --upgrade \\\n",
    "  \"sentence-transformers>=3\" \\\n",
    "  \"datasets==2.19.1\"  \\\n",
    "  \"transformers==4.41.2\" \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. This means we will automatically push our model, logs and information to the Hub during training. You must register on the [Hugging Face](https://huggingface.co/join) for this. After you have an account, we will use the `login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import login\n",
    "\n",
    "login(token=\"\", add_to_git_credential=True)  # ADD YOUR TOKEN HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Create & Prepare embedding dataset\n",
    "\n",
    "An embedding dataset typically consists of text pairs (question, answer/context) or triplets that represent relationships or similarities between sentences. The dataset format you choose or have available will also impact the loss function you can use. Common formats for embedding datasets:\n",
    "\n",
    "- **Positive Pair**: Text Pairs of related sentences (query, context | query, answer), suitable for tasks like similarity or semantic search, example datasets: `sentence-transformers/sentence-compression`, `sentence-transformers/natural-questions`.\n",
    "- **Triplets**: Text triplets consisting of (anchor, positive, negative), example datasets `sentence-transformers/quora-duplicates`, `nirantk/triplets`.\n",
    "- **Pair with Similarity Score**: Sentence pairs with a similarity score indicating how related they are, example datasets: `sentence-transformers/stsb`, `PhilipMay/stsb_multi_mt`\n",
    "\n",
    "Learn more at [Dataset Overview](https://sbert.net/docs/sentence_transformer/dataset_overview.html).\n",
    "\n",
    "We are going to use [philschmid/finanical-rag-embedding-dataset](https://huggingface.co/datasets/philschmid/finanical-rag-embedding-dataset), which includes 7,000 positive text pairs of questions and corresponding context from the [2023_10 NVIDIA SEC Filing](https://stocklight.com/stocks/us/nasdaq-nvda/nvidia/annual-reports/nasdaq-nvda-2023-10K-23668751.pdf).\n",
    "\n",
    "The dataset has the following format\n",
    "```json\n",
    "{\"question\": \"<question>\", \"context\": \"<relevant context to answer>\"}\n",
    "{\"question\": \"<question>\", \"context\": \"<relevant context to answer>\"}\n",
    "{\"question\": \"<question>\", \"context\": \"<relevant context to answer>\"}\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to load our open-source dataset using the 🤗 Datasets library, rename the colums to match what `sentence-transforemrs` expects and then split our dataset into a train and test spilt to be able to evaluate our model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "# Load dataset from the hub\n",
    "dataset = load_dataset(\"philschmid/finanical-rag-embedding-dataset\", split=\"train\")\n",
    "\n",
    "# rename columns\n",
    "dataset = dataset.rename_column(\"question\", \"anchor\")\n",
    "dataset = dataset.rename_column(\"context\", \"positive\")\n",
    "\n",
    "# Add an id column to the dataset\n",
    "dataset = dataset.add_column(\"id\", range(len(dataset)))\n",
    "\n",
    "# split dataset into a 10% test set\n",
    "dataset = dataset.train_test_split(test_size=0.1)\n",
    "\n",
    "# save datasets to disk\n",
    "dataset[\"train\"].to_json(\"train_dataset.json\", orient=\"records\")\n",
    "dataset[\"test\"].to_json(\"test_dataset.json\", orient=\"records\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Create baseline and evaluate pretrained model \n",
    "\n",
    "After we created our dataset we want to create a baseline. A baseline provides use reference point against which the performance of your customized model can be measured. By evaluating the performance of a pretrained model on our specific dataset, we gain insights into the initial effectiveness and can identify areas for improvement.\n",
    "\n",
    "For our example, we will use the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) model as our starting point. [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) is one of the strongest open embedding models for it size, with only 109M parameters and a hidden dimension of 768 it achieves `63.55` on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).\n",
    "\n",
    "We are going to use the [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) to evaluate the performance of our model on a given set of queries and corpus set. It will retrieve for each query the top-k most similar document. It measures Mean Reciprocal Rank (MRR), Recall@k, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG). \n",
    "\n",
    "For us the most important metric will be Normalized Discounted Cumulative Gain (NDCG) as it is a measure of ranking quality. It takes into account the position of the relevant document in the ranking and discounts it. The discounted value is logarithmic, which means that relevant documents are more important if they are ranked higher.\n",
    "\n",
    "For our evaluation corpus we will use all \"documents\" for potential retrieval from the train and test split and the queries from the test set. As mentioned in the beginning we are going to use Matryoshka Representation Learning to boost efficiency. We are going to create a baseline for the following dimensions `64`, `128`, `256`, `512`, `768`. Since those are the dimensions we are going to use for our Matryoshka Representation Learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from sentence_transformers.evaluation import (\n",
    "    InformationRetrievalEvaluator,\n",
    "    SequentialEvaluator,\n",
    ")\n",
    "from sentence_transformers.util import cos_sim\n",
    "from datasets import load_dataset, concatenate_datasets\n",
    "\n",
    "model_id = \"BAAI/bge-base-en-v1.5\"  # Hugging Face model ID\n",
    "matryoshka_dimensions = [768, 512, 256, 128, 64] # Important: large to small\n",
    "\n",
    "# Load a model\n",
    "model = SentenceTransformer(\n",
    "    model_id, device=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    ")\n",
    "\n",
    "# load test dataset\n",
    "test_dataset = load_dataset(\"json\", data_files=\"test_dataset.json\", split=\"train\")\n",
    "train_dataset = load_dataset(\"json\", data_files=\"train_dataset.json\", split=\"train\")\n",
    "corpus_dataset = concatenate_datasets([train_dataset, test_dataset])\n",
    "\n",
    "# Convert the datasets to dictionaries\n",
    "corpus = dict(\n",
    "    zip(corpus_dataset[\"id\"], corpus_dataset[\"positive\"])\n",
    ")  # Our corpus (cid => document)\n",
    "queries = dict(\n",
    "    zip(test_dataset[\"id\"], test_dataset[\"anchor\"])\n",
    ")  # Our queries (qid => question)\n",
    "\n",
    "# Create a mapping of relevant document (1 in our case) for each query\n",
    "relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])\n",
    "for q_id in queries:\n",
    "    relevant_docs[q_id] = [q_id]\n",
    "\n",
    "\n",
    "matryoshka_evaluators = []\n",
    "# Iterate over the different dimensions\n",
    "for dim in matryoshka_dimensions:\n",
    "    ir_evaluator = InformationRetrievalEvaluator(\n",
    "        queries=queries,\n",
    "        corpus=corpus,\n",
    "        relevant_docs=relevant_docs,\n",
    "        name=f\"dim_{dim}\",\n",
    "        truncate_dim=dim,  # Truncate the embeddings to a certain dimension\n",
    "        score_functions={\"cosine\": cos_sim},\n",
    "    )\n",
    "    matryoshka_evaluators.append(ir_evaluator)\n",
    "\n",
    "# Create a sequential evaluator\n",
    "evaluator = SequentialEvaluator(matryoshka_evaluators)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to use the `evaluator` also for evaluation during training. But first lets create our baseline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate the model\n",
    "results = evaluator(model)\n",
    "\n",
    "# # COMMENT IN for full results\n",
    "# print(results)\n",
    "\n",
    "# Print the main score\n",
    "for dim in matryoshka_dimensions:\n",
    "    key = f\"dim_{dim}_cosine_ndcg@10\"\n",
    "    print\n",
    "    print(f\"{key}: {results[key]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great our default baseline with the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) are:\n",
    "- dim 768: `0.7683576219287339`\n",
    "- dim 512: `0.7642500951356254`\n",
    "- dim 256: `0.7546468566985919`\n",
    "- dim 128: `0.7233663127615272`\n",
    "- dim 64: `0.6439999058834552`\n",
    "\n",
    "Now, let's see if we can improve this score by fine-tuning the model on our specific dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Define loss function with Matryoshka Representation\n",
    "\n",
    "When fine-tuning embedding models we select a [loss function based on our dataset format](https://sbert.net/docs/sentence_transformer/loss_overview.html). For Positive Text pairs we can use the `MultipleNegativesRankingLoss` in combination with the `MatryoshkaLoss`. This approach allows us to leverage the efficiency and flexibility of Matryoshka embeddings, enabling different embedding dimensions to be utilized without significant performance trade-offs. The `MultipleNegativesRankingLoss` is a great loss function if you only have positive pairs as it adds in batch negative samples to the loss function to have per sample n-1 negative samples. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets reload our model using `SDPA` or Flash Attention 2 as `attn_implementation` and define a model card. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformerModelCardData, SentenceTransformer\n",
    "\n",
    "# Hugging Face model ID: https://huggingface.co/BAAI/bge-base-en-v1.5\n",
    "model_id = \"BAAI/bge-base-en-v1.5\"\n",
    "\n",
    "# load model with SDPA for using Flash Attention 2\n",
    "model = SentenceTransformer(\n",
    "    model_id,\n",
    "    model_kwargs={\"attn_implementation\": \"sdpa\"},\n",
    "    model_card_data=SentenceTransformerModelCardData(\n",
    "        language=\"en\",\n",
    "        license=\"apache-2.0\",\n",
    "        model_name=\"BGE base Financial Matryoshka\",\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we loaded our model we can initialize our loss function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss\n",
    "\n",
    "matryoshka_dimensions = [768, 512, 256, 128, 64]  # Important: large to small\n",
    "inner_train_loss = MultipleNegativesRankingLoss(model)\n",
    "train_loss = MatryoshkaLoss(\n",
    "    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Fine-tune embedding model with `SentenceTransformersTrainer`\n",
    "\n",
    "We are now ready to fine-tune our model. We will use the [SentenceTransformersTrainer](https://sbert.net/docs/package_reference/sentence_transformer/trainer.html#sentencetransformertrainer) a subclass of the `Trainer` from the `transformers` library, which supports all the same features, including logging, evaluation, and checkpointing. \n",
    "\n",
    "In addition to this there is a `SentenceTransformerTrainingArguments` class that allows us to specify all the training parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformerTrainingArguments\n",
    "from sentence_transformers.training_args import BatchSamplers\n",
    "\n",
    "# load train dataset again\n",
    "train_dataset = load_dataset(\"json\", data_files=\"train_dataset.json\", split=\"train\")\n",
    "\n",
    "# define training arguments\n",
    "args = SentenceTransformerTrainingArguments(\n",
    "    output_dir=\"bge-base-financial-matryoshka\", # output directory and hugging face model ID\n",
    "    num_train_epochs=4,                         # number of epochs\n",
    "    per_device_train_batch_size=32,             # train batch size\n",
    "    gradient_accumulation_steps=16,             # for a global batch size of 512\n",
    "    per_device_eval_batch_size=16,              # evaluation batch size\n",
    "    warmup_ratio=0.1,                           # warmup ratio\n",
    "    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value\n",
    "    lr_scheduler_type=\"cosine\",                 # use constant learning rate scheduler\n",
    "    optim=\"adamw_torch_fused\",                  # use fused adamw optimizer\n",
    "    tf32=True,                                  # use tf32 precision\n",
    "    bf16=True,                                  # use bf16 precision\n",
    "    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch\n",
    "    eval_strategy=\"epoch\",                      # evaluate after each epoch\n",
    "    save_strategy=\"epoch\",                      # save after each epoch\n",
    "    logging_steps=10,                           # log every 10 steps\n",
    "    save_total_limit=3,                         # save only the last 3 models\n",
    "    load_best_model_at_end=True,                # load the best model when training ends\n",
    "    metric_for_best_model=\"eval_dim_128_cosine_ndcg@10\",  # Optimizing for the best ndcg@10 score for the 128 dimension\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have every building block we need to create our `SentenceTransformersTrainer` to start then training our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformerTrainer\n",
    "\n",
    "trainer = SentenceTransformerTrainer(\n",
    "    model=model, # bg-base-en-v1\n",
    "    args=args,  # training arguments\n",
    "    train_dataset=train_dataset.select_columns(\n",
    "        [\"anchor\", \"positive\"]\n",
    "    ),  # training dataset\n",
    "    loss=train_loss,\n",
    "    evaluator=evaluator,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start training our model by calling the `train()` method on our `SentenceTransformerTrainer` instance. This will start the training loop and train our model for 4 epochs. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# start training, the model will be automatically saved to the hub and the output directory\n",
    "trainer.train()\n",
    "\n",
    "# save the best model\n",
    "trainer.save_model()\n",
    "\n",
    "# push model to hub\n",
    "trainer.model.push_to_hub(\"bge-base-financial-matryoshka\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training with Flash Attention (SDPA) for 4 epochs on 6.3k samples took ~00:03:26 on a `g5.2xlarge`. The instance costs `1.212$/h` which brings us to a total cost of only `0.07$` for the training. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Evaluate fine-tuned model against baseline\n",
    "\n",
    "We evaluated our model during training, but we also want to evaluate it against our baseline at the end. We use the same `InformationRetrievalEvaluator` to evaluate the performance of our model on a given set of queries and corpus set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "fine_tuned_model = SentenceTransformer(\n",
    "    args.output_dir, device=\"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    ")\n",
    "# Evaluate the model\n",
    "results = evaluator(fine_tuned_model)\n",
    "\n",
    "# # COMMENT IN for full results\n",
    "# print(results)\n",
    "\n",
    "# Print the main score\n",
    "for dim in matryoshka_dimensions:\n",
    "    key = f\"dim_{dim}_cosine_ndcg@10\"\n",
    "    print(f\"{key}: {results[key]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our fine-tuned model achieves:\n",
    "- dim 768: `0.8253652077429072`\n",
    "- dim 512: `0.8274643684492735`\n",
    "- dim 256: `0.8230326345640168`\n",
    "- dim 128: `0.8184049256658124`\n",
    "- dim 64: `0.7892150398663841`\n",
    "\n",
    "Lets put this into a table and compare the performance of our fine-tuned model against the baseline.\n",
    "\n",
    "| Dimension | Baseline | Fine-tuned | Improvement |\n",
    "|-----------|----------|------------| ----------- |\n",
    "| 768       | 0.7684 | 0.8254 | 7.42% |\n",
    "| 512       | 0.7643 | 0.8275 | 8.27% |\n",
    "| 256       | 0.7546 | 0.8230 | 9.06% |\n",
    "| 128       | 0.7234 | 0.8184 | 13.13% |\n",
    "| 64        | 0.6440 | 0.7892 | 22.55% |\n",
    "\n",
    "Lets try to understand this: \n",
    "* Fine-tuned model outperforms the baseline model in all dimensions. \n",
    "* BGE base is already a very strong base model but fine-tuning it on only 6.3k samples still improves performance by ~7.4%.\n",
    "* Matryoshka representation learning works and keeps 95% performance at 64 dimensions (12x size reduction), 99% at 128 dimensions (7x size reduction), >99.5% at 256 dimensions (3x size reduction). \n",
    "* The Fine-tuned model with lowest dimension (64) outperforms the baseline model with highest dimensions (768). \n",
    "* Optimizing the model dimension of 128 allows us to keep 99% of the performance of the 768 dimension model but reducing the storing cost by 6x and improving the cosine similarity search. \n",
    "* The fine-tuned model with 128 dimensions outperforms the baseline model with 768 dimensions by 6.51%, while being 6x smaller.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Embedding models are crucial for successfull RAG applications, since if you don't retrieve the right context you can't generate the right answer. Customizing embedding models for domain-specific data can improve retrieval performance significantly compared to using general knowledge models. Fine-tuning embedding models has become highly accessible, and using synthetic data generated by LLMs, one can easily customize models for specific needs, resulting in substantial improvements.\n",
    "\n",
    "Our results show that fine-tuning can boost performance by ~7% with only 6.3k sample. The training took 3 minutes on a consumer size GPU and by leveraging modern techniques like Matryoshka Representation Learning we achieved over 99% performance retention with 6x storage reduction and efficiency gains."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pytorch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
